{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.3"},"colab":{"name":"lda_80_10000_0.30.ipynb","provenance":[],"collapsed_sections":[],"toc_visible":true}},"cells":[{"cell_type":"markdown","metadata":{"id":"5Ngy4ScKRlV_","colab_type":"text"},"source":["# Stage 2. LDA Topic modeling\n","\n","Code for performing topic modeling on the science fiction corpus. Novels were chunked chunking the novels at the page level using the scikit-learn Latent Dirichlet allocation (LDA) library.\n","\n","This code was written and provided by Matthew Wilkens. I will signal only the few lines I have written at the end of the notebook to store the results of the algorithm."]},{"cell_type":"markdown","metadata":{"id":"0Qa3rxpjrQOQ","colab_type":"text"},"source":["## 1. Imports, variables and downloads"]},{"cell_type":"code","metadata":{"id":"iNByf8gVRlWD","colab_type":"code","colab":{}},"source":["%matplotlib inline\n","\n","import pandas as pd\n","import os\n","import sys\n","import numpy as np\n","import glob\n","import gzip\n","\n","from   htrc_features import FeatureReader, utils as frutils\n","from   nltk.stem import WordNetLemmatizer\n","from   sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer\n","from   sklearn.decomposition import LatentDirichletAllocation\n","\n","\n","# Directories for input and output\n","figDir = 'figures'\n","resultsDir = 'results'\n","inputDir = 'inputs'\n","\n","# Full corpus data can be large; make it easy to stash outside GitHub/Google\n","bigDir = '.' # Base directory for large files\n","htrcefDir = os.path.join(bigDir, 'htrcef') # HTRC-EF JSONs\n","corpus_file = os.path.join(bigDir, 'corpus.txt.gz') # Text version of corpus\n","corpus_ids_file = os.path.join(bigDir, 'corpus_ids.txt.gz') # Corpus identifiers\n","corpus_file_trim = os.path.join(bigDir, 'corpus_trimmed.txt.gz') # Trimmed version of corpus\n","corpus_ids_file_trim = os.path.join(bigDir, 'corpus_trimmed_ids.txt.gz') # Trimmed corpus ids\n","logfile = os.path.join(bigDir, 'lda.log') # Log file (for Gensim)\n","\n","os.makedirs(figDir, exist_ok=True)\n","os.makedirs(resultsDir, exist_ok=True)\n","os.makedirs(inputDir, exist_ok=True)\n","os.makedirs(htrcefDir, exist_ok=True)\n","\n","# Variables that affect processing\n","reprocess = True # Discard very short pages?\n","stoplist_file = os.path.join(inputDir, 'stopwords-underwood-goldstone.txt')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"a1JzKg4ZRlWM","colab_type":"text"},"source":["###Science fiction HTIDs\n"]},{"cell_type":"code","metadata":{"id":"TYkxxHjZRlWO","colab_type":"code","outputId":"2e6e4614-974e-4f07-b641-a3b04910ed72","colab":{}},"source":["# List of HTIDs to use\n","import csv\n","\n","def creating_volid_list(csv_file):\n","    \n","    list_htids = []\n","    \n","    with open(csv_file, 'r', encoding='utf-8') as open_csv:\n","        dict_csv = csv.DictReader(open_csv)\n","        for row in dict_csv:\n","            htid = str(row[\"htid\"])\n","            if \"None\" not in htid and htid != \"\":\n","                list_htids.append(htid)\n","\n","        return list_htids\n","    \n","volids = creating_volid_list(\"metadata_with_htids.csv\")\n","\n","print(len(volids))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["331\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"KOeLcdQJRlWW","colab_type":"text"},"source":["### Download HTRC-EF files"]},{"cell_type":"code","metadata":{"id":"E2YOEfHLRlWX","colab_type":"code","outputId":"70ae438e-297b-40d0-8c08-2cb32b710375","colab":{}},"source":["# Download the extracted features files for all volumes in the corpus\n","frutils.download_file(htids=volids, outdir=htrcefDir)"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["(0, None)"]},"metadata":{"tags":[]},"execution_count":3}]},{"cell_type":"markdown","metadata":{"id":"ccfJ1_8-RlWd","colab_type":"text"},"source":["## 2. Process corpus\n","\n","This part of the code is meant to process the text corpus in order to stream it through the algorithm. The novels are chunked by page. Two gzipped flat text files are created: the first contains the htid of the novel and the page number; in the second each page's words are written as one line strings."]},{"cell_type":"markdown","metadata":{"id":"888RNDnBzFSs","colab_type":"text"},"source":["### Cleaning and turning corpus to lists of string\n","\n","Short pages and the few pages at the beginning and end of each novel are discarded as they are likely paratext. Only the tokens tagged as the selected parts of speech are retained. Stopwords are removed. The remaining words are lemmatized."]},{"cell_type":"code","metadata":{"id":"UQo6CMfORlWf","colab_type":"code","colab":{}},"source":["# Turn EF volumes into one-page-per-line corpus\n","# Remove stopwords, select parts of speech, lemmatize, and lowercase\n","\n","# Penn treebank tags to keep\n","pos_to_include = [\n","    'FW',  # foreign\n","    'JJ',  # adjectives\n","    'JJR',\n","    'JJS',\n","    'MD',  # modal\n","    'NN',  # nouns (not proper)\n","    'NNS',\n","    'RB',  # adverbs\n","    'RBR',\n","    'RBS',\n","    'VB',  # verbs\n","    'VBD',\n","    'VBG',\n","    'VBN',\n","    'VBP',\n","    'VBZ'\n","]\n","\n","\n","# Functions to work with EF volumes\n","def encode_volid(volid, direction='path'):\n","    '''\n","    Transform htid into filename encoded version and vice versa\n","    '''\n","    encoding_fixes = {'+':':', '=':'/'}\n","    if direction=='path':\n","        encoding_fixes = {v:k for k,v in encoding_fixes.items()}\n","    for key in encoding_fixes:\n","        volid = volid.replace(key, encoding_fixes[key])\n","    return(volid)\n","\n","\n","# Translate Penn->WordNet PoS tags\n","#  Need WordNet PoS tags for lemmatizer\n","def get_wordnet_pos(treebank_tag):\n","    from nltk.corpus import wordnet\n","    if treebank_tag.startswith('J'):\n","        return wordnet.ADJ\n","    elif treebank_tag.startswith('V'):\n","        return wordnet.VERB\n","    elif treebank_tag.startswith('M'):\n","        return wordnet.VERB\n","    elif treebank_tag.startswith('R'):\n","        return wordnet.ADV\n","    else:\n","        return wordnet.NOUN\n","    \n","# Transform EF page to space-delimited string\n","def efpage2doc(page, cutoff=50):\n","    doc = ''\n","    if page.token_count() >= cutoff:\n","        tokens = page.tokenlist(case=False).query('pos in @pos_to_include')\n","        for token in tokens.itertuples():\n","            word = token.Index[2]\n","            if word not in stoplist:\n","                pos = get_wordnet_pos(token.Index[3])\n","                word = lemmatizer.lemmatize(word, pos=pos)\n","                for i in range(token.count):\n","                    doc += word+' '\n","    return(doc.strip())\n","\n","# Transform EF volume into list of one-line page-level document strings\n","def efvol2docs(vol, skip_first=10, skip_last=10, min_page_tokens=50):\n","    docs = []\n","    pages = []\n","    for page in vol:\n","        if (skip_first < int(page.seq) < (vol.page_count-skip_last)):\n","            docs.append(efpage2doc(page, min_page_tokens))\n","            pages.append(page.seq)\n","    return(pages, docs)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"nCOalsgnRlWl","colab_type":"code","outputId":"2b517691-b19e-438c-c890-51e96582646f","colab":{}},"source":["# Read stoplist and set up\n","min_page_tokens = 50 # Ignore pages with fewer than this many tokens\n","skip_first_pages = 10 # Skip first and last pages of each book (paratext)\n","skip_last_pages = 10\n","\n","stoplist = [line.strip() for line in open(stoplist_file)]\n","stoplist = set(stoplist)\n","print(\"Words in stoplist:\", len(stoplist))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Words in stoplist: 6048\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"lFQxXXf2RlWq","colab_type":"text"},"source":["### Perform first processing pass\n","\n","The two algorithms above are part of the following greater function that creates two gzipped text files. In `corpus_file` each line is a page of a novel. Whereas, in `corpus_ids_file`, each line is the htid and page number of a novel. As result, it is possible to later reconstruct which pages belong to which books in what order.\n"]},{"cell_type":"code","metadata":{"scrolled":true,"id":"7W4uC2hfRlWs","colab_type":"code","outputId":"31d49461-7807-4b0d-f45d-beea8528c067","colab":{}},"source":["%%time\n","# Perform corpus processing\n","lemmatizer = WordNetLemmatizer() # Initialize lemmatizer\n","vols_processed = 0 # Count volumes processed\n","vols_error = [] # Keep track of volumes with errors\n","\n","# Read HTRC-EF files, process, write out as single (gzipped) corpus file\n","with gzip.open(corpus_ids_file, 'wt') as fi:\n","    with gzip.open(corpus_file, 'wt', encoding='utf-8') as fd:\n","        for volid in volids:\n","            try:\n","                vol = FeatureReader(\n","                    os.path.join(\n","                        htrcefDir,\n","                        f'{encode_volid(volid)}.json.bz2'\n","                    )).first()\n","                pages, docs = efvol2docs(\n","                    vol, \n","                    skip_first=skip_first_pages,\n","                    skip_last=skip_last_pages,\n","                    min_page_tokens=min_page_tokens\n","                )\n","                for doc in docs:\n","                    fd.write(doc+'\\n')\n","                for page in pages:\n","                    fi.write(f'{volid} {page}\\n')\n","            except:\n","                vols_error.append(volid)\n","            vols_processed += 1\n","\n","print(f'Processed {vols_processed} vols with {len(vols_error)} errors')"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Processed 331 vols with 0 errors\n","CPU times: user 35min 15s, sys: 2.42 s, total: 35min 17s\n","Wall time: 35min 18s\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"njmAsNWoRlWx","colab_type":"text"},"source":["### Perform second processing pass\n","\n","It's possible for the above corpus setup to produce some page-level documents that are empty or very short. This may cause errors. We can reprocess to remove those pages."]},{"cell_type":"code","metadata":{"id":"bOr4_jf2RlW1","colab_type":"code","colab":{}},"source":["corpus_to_use = corpus_file\n","ids_file = corpus_ids_file\n","\n","if reprocess:\n","    corpus_to_use = corpus_file_trim\n","    ids_file = corpus_ids_file_trim\n","\n","    # Reprocess corpus to remove empty and very short docs\n","    from collections import deque\n","    indices_to_delete = deque()\n","    counter = 0\n","    with gzip.open(\n","        corpus_file_trim,\n","        'wt', \n","        encoding='utf-8'\n","    ) as f:    \n","        for line in gzip.open(corpus_file, 'rt', encoding='utf-8'):\n","            if len(line.strip().split()) < 10:\n","                indices_to_delete.append(counter)\n","            else:\n","                f.write(line)\n","            counter += 1\n","    to_delete = indices_to_delete.copy()\n","    skip = to_delete.popleft()\n","    counter = 0\n","    with gzip.open(\n","        corpus_ids_file_trim,\n","        'wt',\n","        encoding='utf-8'\n","    ) as f:\n","        for line in gzip.open(corpus_ids_file, 'rt', encoding='utf-8'):\n","            if counter == skip:\n","                if len(to_delete) > 0:\n","                    skip = to_delete.popleft()\n","                else:\n","                    pass\n","            else:\n","                f.write(line)\n","            counter += 1"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"lQSxNaF1_E3Q","colab_type":"text"},"source":["## 3. Core of the code"]},{"cell_type":"markdown","metadata":{"id":"6mO-AJdxRlW7","colab_type":"text"},"source":["### Vectorize corpus\n","\n","Turn texts into a document-term matrix with specified parameters."]},{"cell_type":"code","metadata":{"id":"3lJXv-ZVRlW9","colab_type":"code","outputId":"162cbae5-b367-4248-dd8a-03b3bec1fa49","colab":{}},"source":["# Streaming corpus, precomputed length\n","#  Use a streaming version of the corpus for memory efficiency\n","class StreamCorpus(object):\n","    def __init__(self, obs=None):\n","        i = 0\n","        for line in gzip.open(corpus_to_use, 'rt', encoding='utf-8'):\n","            i+=1\n","        self._data_len = i\n","    def __iter__(self):\n","        for line in gzip.open(corpus_to_use, 'rt', encoding='utf-8'):\n","            yield (line)\n","    def __len__(self):\n","        return self._data_len\n","    \n","\n","corpus = StreamCorpus()\n","ids = [line.strip() for line in gzip.open(ids_file, 'rt')]\n","\n","# LDA settings\n","max_df = 0.30         # Keep words that appear on no more than x fraction of pages\n","                      # This removes high-frequency words not already included on\n","                      # stopwords list.\n","\n","min_df = 3            # Keep words that appear on at least x total pages\n","                      # Removes very rare words\n","                      # Use larger value for larger corpora\n","\n","max_iter = 20         # Number of LDA passes over corpus\n","\n","n_features = 10000    # Max unique words to use in model\n","                      # Another way to remove very low-frequency words\n","                      # Use larger value for larger corpora\n","\n","n_components = 80     # Number of topics\n","                      # Use larger value for larger corpora\n","\n","# Regular vectorizer. Not memory-efficient, but easy to use\n","count_vectorizer = CountVectorizer(\n","    encoding='utf-8',\n","    max_df=max_df, \n","    min_df=min_df,\n","    max_features=n_features\n",")\n","\n","# Pick a vectorizer\n","vectorizer = count_vectorizer\n","\n","# Perform vectorization\n","tf = vectorizer.fit_transform(corpus)\n","print(tf.shape)"],"execution_count":0,"outputs":[{"output_type":"stream","text":["(93286, 10000)\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"8nv3gmDtRlXC","colab_type":"text"},"source":["### Perform LDA"]},{"cell_type":"code","metadata":{"id":"C8DSxuMIRlXE","colab_type":"code","outputId":"91a8980d-0ef3-4534-820a-0023590b8996","colab":{}},"source":["%%time\n","%env JOBLIB_TEMP_FOLDER=/tmp\n","\n","lda = LatentDirichletAllocation(\n","    n_components=n_components, \n","    max_iter=max_iter,\n","    learning_method='online',\n","    learning_offset=50.,\n","    random_state=0,\n","    n_jobs=1\n",")"],"execution_count":0,"outputs":[{"output_type":"stream","text":["env: JOBLIB_TEMP_FOLDER=/tmp\n","CPU times: user 0 ns, sys: 8 ms, total: 8 ms\n","Wall time: 6.63 ms\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"9scC5RayRlXJ","colab_type":"code","outputId":"fbef042d-46a7-426a-8c3e-a6acf371193a","colab":{}},"source":["lda.fit(tf)"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n","             evaluate_every=-1, learning_decay=0.7,\n","             learning_method='online', learning_offset=50.0,\n","             max_doc_update_iter=100, max_iter=20, mean_change_tol=0.001,\n","             n_components=80, n_jobs=1, n_topics=None, perp_tol=0.1,\n","             random_state=0, topic_word_prior=None,\n","             total_samples=1000000.0, verbose=0)"]},"metadata":{"tags":[]},"execution_count":28}]},{"cell_type":"markdown","metadata":{"id":"u2fxf_fl--VM","colab_type":"text"},"source":["## 4. Preliminary overview of the results"]},{"cell_type":"markdown","metadata":{"id":"Aj8vKHM8RlXO","colab_type":"text"},"source":["### Topic keywords"]},{"cell_type":"code","metadata":{"id":"JHgKTayuRlXP","colab_type":"code","outputId":"dfc618da-686a-4b35-86db-569b8f9038f9","colab":{}},"source":["n_top_words = 10     # Print this many words per topic\n","def print_top_words(model, feature_names, n_top_words):\n","    '''\n","    Print the top words associated with each modeled topic.\n","    '''\n","    index_words = []\n","    print(\"\\nTopics in LDA model:\\n==========\")\n","    for topic_idx, topic in enumerate(model.components_): #num of the topic and the topic itself\n","        message = \"Topic #%d: \" % topic_idx\n","        message += \" \".join([feature_names[i] #get word/feature from the vectorizer array\n","                             for i in topic.argsort()[:-n_top_words - 1:-1]])  #select ten last indeces that correspond to top words in the topic\n","        print(message)\n","        index_words.append(message.split(' ')[2])\n","    print()\n","    return index_words\n","\n","#each topic in the model contains all the words/features in the same order as the array of words/features in the vectorizer\n","\n","tf_feature_names = vectorizer.get_feature_names()\n","topic_index_words = print_top_words(lda, tf_feature_names, n_top_words)"],"execution_count":0,"outputs":[{"output_type":"stream","text":["\n","Topics in LDA model:\n","==========\n","Topic #0: cell dragon monster beard suck jungle net smell dwarf beast\n","Topic #1: all act resume propose hostile aura suppress confuse determination trifle\n","Topic #2: black white wear fish red dress brown shirt blue hair\n","Topic #3: body spine jaw teeth mouth flesh bone gill corpse wan\n","Topic #4: technician interstellar stallion designer peach rating earthling motorcycle orgasm coastal\n","Topic #5: battle mad pretend chill console stable mask behave fighter skinny\n","Topic #6: shake head sit room smile chair table face nod turn\n","Topic #7: re ll ve want right let good maybe try ask\n","Topic #8: boat suit sand station pilot engine bridge drive launch cabin\n","Topic #9: read book write letter page story thumb chapter novel author\n","Topic #10: commander circuit an transmission sport monk electronic calculate reflex uneasily\n","Topic #11: run foot away head arm side pull again leg ground\n","Topic #12: love never child woman life give mother own world heart\n","Topic #13: water river north wind land ice island south dorsal snow\n","Topic #14: gene garbage cultural weekend priority vector cruciform baseball colonist cooperative\n","Topic #15: world life new history work people become science experience future\n","Topic #16: movie assignment jail bust hostage trash therapy shard uniformed cod\n","Topic #17: fin scale god earth nation head prince someday end spot\n","Topic #18: factory grand giggle ritual com discard cue oblique haven novice\n","Topic #19: mat legend booth pouch cosmic tick fingernail braid torso faction\n","Topic #20: quiet smoke armor flare boil fence rumble fifty mute clumsy\n","Topic #21: government case report use give suspect point term state obvious\n","Topic #22: kilometer centimeter detector zip headset swiss tournament gutted mull abortion\n","Topic #23: project quietly explanation examine await singer equation cable incident proof\n","Topic #24: meter handle winter tent punch cousin flip glove five compartment\n","Topic #25: completely crazy cost american camera mess japanese detect scheme owner\n","Topic #26: snort plague spell be concentration reluctantly annoy courtesy momentarily wizard\n","Topic #27: village match chin uniform hut guy handful eight off logical\n","Topic #28: work good people little eat much day use never food\n","Topic #29: fire gun burn weapon flame radio shoot rifle shot pulse\n","Topic #30: tree sky sun green cloud light blue forest white black\n","Topic #31: problem tap wagon car solution solve stalk rhythm preparation number\n","Topic #32: line ahead figure second attention follow move check steal guide\n","Topic #33: quot lodge emphasize sculpture clown specialize guitar fugue dimple indignantly\n","Topic #34: name frown calm snake heavily call grimace young highway bargain\n","Topic #35: paper file office political pattern copy record department system german\n","Topic #36: city street building guard people crowd gate stand walk men\n","Topic #37: call sir message ask wait send voice leave give minute\n","Topic #38: night sleep day dream hour bed morning wake last leave\n","Topic #39: bitch gonna bureaucrat discharge parlor dope optical boyfriend casino uh\n","Topic #40: ship screen captain crew space sail deck instrument orbit total\n","Topic #41: drunk talent stamp armored neutral dont goose genuinely jewelry driveway\n","Topic #42: men speak demand father lord king well word priest great\n","Topic #43: face light stand seem move turn again moment felt voice\n","Topic #44: television gotta mutation aint microphone forefinger temporal vertebra colorful chore\n","Topic #45: hear repeat sound speech listen word audience voice final perform\n","Topic #46: count lucky president vote reserve secretary belief conviction quarrel astonish\n","Topic #47: horse ride mount defense castle rider queen mouse saddle rid\n","Topic #48: ask seem find much well something question perhaps want nothing\n","Topic #49: baby breakfast cigarette counter kitchen cat wash dish plate bathroom\n","Topic #50: wall stone circle roof foot tower climb small large depth\n","Topic #51: door room open window floor wall light house step lock\n","Topic #52: wrist pad translate volunteer ramp crow salute unpleasant shuffle mill\n","Topic #53: boy brother face old hear voice laugh hold sword arm\n","Topic #54: program coffee tank egg hatch beer milk pot salt chicken\n","Topic #55: reluctant eunuch taut intrigue ironic grandson clack scold chute gasoline\n","Topic #56: law control self gear sex murder crime patient characteristic definitely\n","Topic #57: girl bus bug inquire sympathy maid blonde blend childish am\n","Topic #58: great much well high become general regard seem rather power\n","Topic #59: matrix changeling pod ranking slither technological configuration handhold calorie orient\n","Topic #60: planet system world year space control power star earth machine\n","Topic #61: number length small large specie type head female adult animal\n","Topic #62: woman nose men doctor frighten skin hair okay dress wear\n","Topic #63: computer percent cm helicopter dimensional skid klick cheekbone rectangular crawler\n","Topic #64: unit tape element shuttle model technique cycle area caution artifact\n","Topic #65: year old day young family home first child name call\n","Topic #66: human race alien being creature mind language civilization religion primitive\n","Topic #67: target cop status sprawl trouser loyal surprising pilgrim translation photo\n","Topic #68: dog kid snap data plane elevator slam port fantasy gut\n","Topic #69: crap makeup conditioning recoil overload canadian trauma khaki overgrown tipped\n","Topic #70: recorder horde jus vomit caravan flail triple intersection boost refrigerator\n","Topic #71: honor bastard clone sled apprentice glower massage ornate downstream morale\n","Topic #72: money buy pay sell price dollar market gold coin bill\n","Topic #73: war soldier officer army enemy men prisoner leader peace command\n","Topic #74: death kill dead fight life own power die find still\n","Topic #75: virus nuclear awareness genetic grid seep heartbeat mutant powered airplane\n","Topic #76: require usual knight security don religious summon ﬁrst royal can\n","Topic #77: drink glass wine table cup bottle water tea card sip\n","Topic #78: clan rice airlock dolphin tricky marking tribal administrative torturer flagship\n","Topic #79: mile ray rock mountain range road flower side low line\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"voArokNpRlXU","colab_type":"text"},"source":["### Scoring\n","\n","Not doing anything with this now, but could help determine the \"best\" number of topics by creating multiple models and optimizing `score()` and `perplexity()` outputs. Values are meaningful only in comparison to other models of the same data. There is no abstractly \"good\" score."]},{"cell_type":"code","metadata":{"id":"oycFKyAwRlXW","colab_type":"code","outputId":"655c66a4-b3b1-4ef4-f209-2f4a434289de","colab":{}},"source":["print(\"Log-likelihood score:\", round(lda.score(tf), 2))\n","print(\"Perplexity:\", round(lda.perplexity(tf), 2))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Log-likelihood score: -96275597.05\n","Perplexity: 6493.67\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"WaBE29xtRlXa","colab_type":"text"},"source":["### Doc-topic matrix\n","\n","Create a pandas dataframe containing topic fractions for each document. Each \"document\" is one page of a volume for our purposes."]},{"cell_type":"code","metadata":{"id":"eluBRKfuRlXc","colab_type":"code","outputId":"d6007846-118c-4d14-b4bf-c14ba0f45c79","colab":{}},"source":["# Transform input vectors to topics\n","lda_output = lda.transform(tf)\n","\n","# column names\n","topicnames = [f\"t{str(i)} {topic_index_words[i]}\" for i in range(lda.n_components)]\n","\n","# index names\n","docnames = ids\n","\n","# Make the dataframe\n","dtm = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)\n","\n","# Get dominant topic for each document\n","dominant_topic = np.argmax(dtm.values, axis=1)\n","dtm['dominant_topic'] = dominant_topic\n","\n","# Add htid and page number for each document\n","htids = []\n","pagenos = []\n","for docid in dtm.index:\n","    htid, pageno = docid.split()\n","    htids.append(htid)\n","    pagenos.append(pageno)\n","dtm['htid'] = htids\n","dtm['page'] = pagenos\n","\n","print(\"Sample rows of the document-topic matrix:\")\n","display(dtm.sample(10))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Sample rows of the document-topic matrix:\n"],"name":"stdout"},{"output_type":"display_data","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>t0 cell</th>\n","      <th>t1 all</th>\n","      <th>t2 black</th>\n","      <th>t3 body</th>\n","      <th>t4 technician</th>\n","      <th>t5 battle</th>\n","      <th>t6 shake</th>\n","      <th>t7 re</th>\n","      <th>t8 boat</th>\n","      <th>t9 read</th>\n","      <th>...</th>\n","      <th>t73 war</th>\n","      <th>t74 death</th>\n","      <th>t75 virus</th>\n","      <th>t76 require</th>\n","      <th>t77 drink</th>\n","      <th>t78 clan</th>\n","      <th>t79 mile</th>\n","      <th>dominant_topic</th>\n","      <th>htid</th>\n","      <th>page</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>mdp.39015020711092 00000296</th>\n","      <td>0.01</td>\n","      <td>0.0</td>\n","      <td>0.03</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.03</td>\n","      <td>0.00</td>\n","      <td>0.02</td>\n","      <td>0.06</td>\n","      <td>0.0</td>\n","      <td>0.04</td>\n","      <td>43</td>\n","      <td>mdp.39015020711092</td>\n","      <td>00000296</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.49015002042985 00000127</th>\n","      <td>0.06</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.11</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>11</td>\n","      <td>mdp.49015002042985</td>\n","      <td>00000127</td>\n","    </tr>\n","    <tr>\n","      <th>uc1.32106002188628 00000300</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.08</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.23</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>43</td>\n","      <td>uc1.32106002188628</td>\n","      <td>00000300</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.39015021474286 00000223</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.04</td>\n","      <td>0.01</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.06</td>\n","      <td>43</td>\n","      <td>mdp.39015021474286</td>\n","      <td>00000223</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.39015005343515 00000218</th>\n","      <td>0.02</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.05</td>\n","      <td>0.22</td>\n","      <td>0.01</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.06</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.01</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>7</td>\n","      <td>mdp.39015005343515</td>\n","      <td>00000218</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.39015034647167 00000091</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>28</td>\n","      <td>mdp.39015034647167</td>\n","      <td>00000091</td>\n","    </tr>\n","    <tr>\n","      <th>inu.30000082136767 00000056</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.16</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.06</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.01</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>48</td>\n","      <td>inu.30000082136767</td>\n","      <td>00000056</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.39015017656870 00000103</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.01</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.03</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.15</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>48</td>\n","      <td>mdp.39015017656870</td>\n","      <td>00000103</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.39015034512692 00000105</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.02</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.03</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.04</td>\n","      <td>0.00</td>\n","      <td>0.01</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>43</td>\n","      <td>mdp.39015034512692</td>\n","      <td>00000105</td>\n","    </tr>\n","    <tr>\n","      <th>mdp.39015082679633 00000041</th>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>0.01</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.00</td>\n","      <td>0.05</td>\n","      <td>0.04</td>\n","      <td>0.00</td>\n","      <td>...</td>\n","      <td>0.0</td>\n","      <td>0.12</td>\n","      <td>0.00</td>\n","      <td>0.01</td>\n","      <td>0.00</td>\n","      <td>0.0</td>\n","      <td>0.00</td>\n","      <td>60</td>\n","      <td>mdp.39015082679633</td>\n","      <td>00000041</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>10 rows × 83 columns</p>\n","</div>"],"text/plain":["                             t0 cell  t1 all  t2 black  t3 body  \\\n","mdp.39015020711092 00000296     0.01     0.0      0.03     0.00   \n","mdp.49015002042985 00000127     0.06     0.0      0.00     0.00   \n","uc1.32106002188628 00000300     0.00     0.0      0.00     0.00   \n","mdp.39015021474286 00000223     0.00     0.0      0.00     0.00   \n","mdp.39015005343515 00000218     0.02     0.0      0.00     0.00   \n","mdp.39015034647167 00000091     0.00     0.0      0.00     0.00   \n","inu.30000082136767 00000056     0.00     0.0      0.16     0.00   \n","mdp.39015017656870 00000103     0.00     0.0      0.00     0.00   \n","mdp.39015034512692 00000105     0.00     0.0      0.00     0.00   \n","mdp.39015082679633 00000041     0.00     0.0      0.00     0.01   \n","\n","                             t4 technician  t5 battle  t6 shake  t7 re  \\\n","mdp.39015020711092 00000296           0.00       0.00      0.00   0.00   \n","mdp.49015002042985 00000127           0.00       0.00      0.00   0.11   \n","uc1.32106002188628 00000300           0.00       0.00      0.00   0.08   \n","mdp.39015021474286 00000223           0.00       0.00      0.00   0.00   \n","mdp.39015005343515 00000218           0.00       0.00      0.05   0.22   \n","mdp.39015034647167 00000091           0.00       0.00      0.00   0.00   \n","inu.30000082136767 00000056           0.00       0.00      0.00   0.00   \n","mdp.39015017656870 00000103           0.01       0.00      0.00   0.03   \n","mdp.39015034512692 00000105           0.00       0.02      0.00   0.00   \n","mdp.39015082679633 00000041           0.00       0.00      0.00   0.05   \n","\n","                             t8 boat  t9 read    ...     t73 war  t74 death  \\\n","mdp.39015020711092 00000296     0.00     0.00    ...         0.0       0.03   \n","mdp.49015002042985 00000127     0.00     0.00    ...         0.0       0.00   \n","uc1.32106002188628 00000300     0.00     0.00    ...         0.0       0.23   \n","mdp.39015021474286 00000223     0.00     0.00    ...         0.0       0.04   \n","mdp.39015005343515 00000218     0.01     0.00    ...         0.0       0.06   \n","mdp.39015034647167 00000091     0.00     0.00    ...         0.0       0.00   \n","inu.30000082136767 00000056     0.00     0.06    ...         0.0       0.00   \n","mdp.39015017656870 00000103     0.00     0.00    ...         0.0       0.15   \n","mdp.39015034512692 00000105     0.03     0.00    ...         0.0       0.04   \n","mdp.39015082679633 00000041     0.04     0.00    ...         0.0       0.12   \n","\n","                             t75 virus  t76 require  t77 drink  t78 clan  \\\n","mdp.39015020711092 00000296       0.00         0.02       0.06       0.0   \n","mdp.49015002042985 00000127       0.00         0.00       0.00       0.0   \n","uc1.32106002188628 00000300       0.00         0.00       0.00       0.0   \n","mdp.39015021474286 00000223       0.01         0.00       0.00       0.0   \n","mdp.39015005343515 00000218       0.00         0.00       0.01       0.0   \n","mdp.39015034647167 00000091       0.00         0.00       0.00       0.0   \n","inu.30000082136767 00000056       0.00         0.00       0.01       0.0   \n","mdp.39015017656870 00000103       0.00         0.00       0.00       0.0   \n","mdp.39015034512692 00000105       0.00         0.01       0.00       0.0   \n","mdp.39015082679633 00000041       0.00         0.01       0.00       0.0   \n","\n","                             t79 mile  dominant_topic                htid  \\\n","mdp.39015020711092 00000296      0.04              43  mdp.39015020711092   \n","mdp.49015002042985 00000127      0.00              11  mdp.49015002042985   \n","uc1.32106002188628 00000300      0.00              43  uc1.32106002188628   \n","mdp.39015021474286 00000223      0.06              43  mdp.39015021474286   \n","mdp.39015005343515 00000218      0.00               7  mdp.39015005343515   \n","mdp.39015034647167 00000091      0.00              28  mdp.39015034647167   \n","inu.30000082136767 00000056      0.00              48  inu.30000082136767   \n","mdp.39015017656870 00000103      0.00              48  mdp.39015017656870   \n","mdp.39015034512692 00000105      0.00              43  mdp.39015034512692   \n","mdp.39015082679633 00000041      0.00              60  mdp.39015082679633   \n","\n","                                 page  \n","mdp.39015020711092 00000296  00000296  \n","mdp.49015002042985 00000127  00000127  \n","uc1.32106002188628 00000300  00000300  \n","mdp.39015021474286 00000223  00000223  \n","mdp.39015005343515 00000218  00000218  \n","mdp.39015034647167 00000091  00000091  \n","inu.30000082136767 00000056  00000056  \n","mdp.39015017656870 00000103  00000103  \n","mdp.39015034512692 00000105  00000105  \n","mdp.39015082679633 00000041  00000041  \n","\n","[10 rows x 83 columns]"]},"metadata":{"tags":[]}}]},{"cell_type":"code","metadata":{"id":"Nb0q8U-qRlXj","colab_type":"code","colab":{}},"source":["# Most frequently dominant topics\n","dtm.dominant_topic.value_counts()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"qh7cQ3c1_guK","colab_type":"text"},"source":["## Storing the results"]},{"cell_type":"markdown","metadata":{"id":"k52VQsTQRlYD","colab_type":"text"},"source":["### Visualization\n","\n","Compute, display and save a handy HTML visualization of the topic model output."]},{"cell_type":"code","metadata":{"id":"7LR8M7sNRlYF","colab_type":"code","outputId":"c2deef56-bb08-4bbe-baa0-02c313c42ecb","colab":{}},"source":["%%time\n","import pyLDAvis\n","import pyLDAvis.sklearn\n","visdata = pyLDAvis.sklearn.prepare(lda, tf, vectorizer)\n","pyLDAvis.save_html(visdata, os.path.join(resultsDir, 'pyLDAvis-80_10000_0.30.html'))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["CPU times: user 1min 18s, sys: 1.68 s, total: 1min 20s\n","Wall time: 1min 27s\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"8n_Bx-taQO8d","colab_type":"text"},"source":["### Save doc-topic matrix to CSV file"]},{"cell_type":"code","metadata":{"scrolled":true,"id":"VH2sRcqDRlXq","colab_type":"code","colab":{}},"source":["#Grouping records of the dataframe by HTID to have the percentages of each topic for each novel\n","#And save to a CSV file\n","dtm.groupby('htid').mean().to_csv(os.path.join(resultsDir, 'lda_80_10000_0.30_htid.csv'))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"HIN5Cvd5Pn4x","colab_type":"text"},"source":["## 5. Highest scoring books for city topic\n","\n","Retrieval of the metadata corresponding to the books with the highest proportion of city topic. Topic 19 in the html visualization was identified as the city topic since it is the most interested in city space.\n","\n","Code by Federica Bologna"]},{"cell_type":"code","metadata":{"id":"rugttHNoPoYK","colab_type":"code","colab":{}},"source":["citytopic_htids = pd.read_csv(\"/content/drive/My Drive/Università/3 ANNO MAGISTRALE/TESI/3_topicmodeling/lda_80_10000_0.30_htid.csv\")[[\"htid\", \"t36 city\"]]"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"QioJBo6jx3m9","colab_type":"code","colab":{}},"source":["titles = pd.read_csv(\"/content/drive/My Drive/Università/3 ANNO MAGISTRALE/TESI/2_terms/scifi_metadata_htids.csv\")[[\"htid\", \"title\", \"date\"]]"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"s-zvKlfox2in","colab_type":"code","colab":{}},"source":["titles[\"decade\"] = (titles[\"date\"]//10)*10"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"vYxTPnWFyUBJ","colab_type":"code","colab":{}},"source":["citytopic_books = pd.merge(left=titles, right=citytopic_htids, on=\"htid\").sort_values(\"t36 city\", ascending=False)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"bv1JTKsJyiUh","colab_type":"code","outputId":"98610979-60bc-427f-8ab5-a88261fbf95b","executionInfo":{"status":"ok","timestamp":1577642849023,"user_tz":-60,"elapsed":585,"user":{"displayName":"Federica Bologna","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mAbXMmexGk_lRJUtqkV2mkHhg7CLpyfyXZzkkac=s64","userId":"07332947698373575453"}},"colab":{"base_uri":"https://localhost:8080/","height":669}},"source":["citytopic_books[:20]"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>htid</th>\n","      <th>title</th>\n","      <th>date</th>\n","      <th>decade</th>\n","      <th>t36 city</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>19</th>\n","      <td>hvd.32044021192513</td>\n","      <td>The chessmen of Mars</td>\n","      <td>1922</td>\n","      <td>1920</td>\n","      <td>0.075039</td>\n","    </tr>\n","    <tr>\n","      <th>180</th>\n","      <td>mdp.39015056811675</td>\n","      <td>Inverted World</td>\n","      <td>1974</td>\n","      <td>1970</td>\n","      <td>0.068370</td>\n","    </tr>\n","    <tr>\n","      <th>15</th>\n","      <td>uc2.ark:/13960/t2794143s</td>\n","      <td>A princess of Mars</td>\n","      <td>1917</td>\n","      <td>1910</td>\n","      <td>0.065164</td>\n","    </tr>\n","    <tr>\n","      <th>61</th>\n","      <td>mdp.39015056438271</td>\n","      <td>The shadow girl</td>\n","      <td>1946</td>\n","      <td>1940</td>\n","      <td>0.061538</td>\n","    </tr>\n","    <tr>\n","      <th>186</th>\n","      <td>mdp.39015062115814</td>\n","      <td>Ragtime</td>\n","      <td>1975</td>\n","      <td>1970</td>\n","      <td>0.057273</td>\n","    </tr>\n","    <tr>\n","      <th>35</th>\n","      <td>uc1.b3184615</td>\n","      <td>Carson of Venus</td>\n","      <td>1938</td>\n","      <td>1930</td>\n","      <td>0.055099</td>\n","    </tr>\n","    <tr>\n","      <th>167</th>\n","      <td>mdp.39015008587365</td>\n","      <td>The Iron Dream</td>\n","      <td>1972</td>\n","      <td>1970</td>\n","      <td>0.052903</td>\n","    </tr>\n","    <tr>\n","      <th>9</th>\n","      <td>dul1.ark:/13960/t1qf9cr59</td>\n","      <td>A time of terror</td>\n","      <td>1906</td>\n","      <td>1900</td>\n","      <td>0.051703</td>\n","    </tr>\n","    <tr>\n","      <th>125</th>\n","      <td>mdp.39015016450234</td>\n","      <td>The Squares of the City</td>\n","      <td>1965</td>\n","      <td>1960</td>\n","      <td>0.048525</td>\n","    </tr>\n","    <tr>\n","      <th>317</th>\n","      <td>mdp.39015054285773</td>\n","      <td>The Years of Rice and Salt</td>\n","      <td>2002</td>\n","      <td>2000</td>\n","      <td>0.040522</td>\n","    </tr>\n","    <tr>\n","      <th>14</th>\n","      <td>mdp.39015001553547</td>\n","      <td>The scarlet plague</td>\n","      <td>1915</td>\n","      <td>1910</td>\n","      <td>0.040000</td>\n","    </tr>\n","    <tr>\n","      <th>28</th>\n","      <td>uc1.32106002143938</td>\n","      <td>Black no more</td>\n","      <td>1931</td>\n","      <td>1930</td>\n","      <td>0.038776</td>\n","    </tr>\n","    <tr>\n","      <th>224</th>\n","      <td>mdp.39015020711092</td>\n","      <td>Lord Valentine's Castle</td>\n","      <td>1980</td>\n","      <td>1980</td>\n","      <td>0.037884</td>\n","    </tr>\n","    <tr>\n","      <th>44</th>\n","      <td>mdp.39015002143181</td>\n","      <td>The Reign of Wizardry</td>\n","      <td>1940</td>\n","      <td>1940</td>\n","      <td>0.036703</td>\n","    </tr>\n","    <tr>\n","      <th>0</th>\n","      <td>nyp.33433076024060</td>\n","      <td>The secret of the crater</td>\n","      <td>1900</td>\n","      <td>1900</td>\n","      <td>0.036156</td>\n","    </tr>\n","    <tr>\n","      <th>330</th>\n","      <td>pst.000067168613</td>\n","      <td>The Windup Girl</td>\n","      <td>2009</td>\n","      <td>2000</td>\n","      <td>0.035960</td>\n","    </tr>\n","    <tr>\n","      <th>134</th>\n","      <td>mdp.39015008978879</td>\n","      <td>Lord of Light</td>\n","      <td>1967</td>\n","      <td>1960</td>\n","      <td>0.035803</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>nyp.33433074850870</td>\n","      <td>The Princess Thora</td>\n","      <td>1904</td>\n","      <td>1900</td>\n","      <td>0.035503</td>\n","    </tr>\n","    <tr>\n","      <th>200</th>\n","      <td>uc1.b4430797</td>\n","      <td>Cirque</td>\n","      <td>1977</td>\n","      <td>1970</td>\n","      <td>0.033966</td>\n","    </tr>\n","    <tr>\n","      <th>324</th>\n","      <td>inu.30000107441606</td>\n","      <td>Going Postal</td>\n","      <td>2005</td>\n","      <td>2000</td>\n","      <td>0.033839</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["                          htid                       title  ...  decade  t36 city\n","19          hvd.32044021192513        The chessmen of Mars  ...    1920  0.075039\n","180         mdp.39015056811675              Inverted World  ...    1970  0.068370\n","15    uc2.ark:/13960/t2794143s          A princess of Mars  ...    1910  0.065164\n","61          mdp.39015056438271             The shadow girl  ...    1940  0.061538\n","186         mdp.39015062115814                     Ragtime  ...    1970  0.057273\n","35                uc1.b3184615             Carson of Venus  ...    1930  0.055099\n","167         mdp.39015008587365              The Iron Dream  ...    1970  0.052903\n","9    dul1.ark:/13960/t1qf9cr59            A time of terror  ...    1900  0.051703\n","125         mdp.39015016450234     The Squares of the City  ...    1960  0.048525\n","317         mdp.39015054285773  The Years of Rice and Salt  ...    2000  0.040522\n","14          mdp.39015001553547          The scarlet plague  ...    1910  0.040000\n","28          uc1.32106002143938               Black no more  ...    1930  0.038776\n","224         mdp.39015020711092     Lord Valentine's Castle  ...    1980  0.037884\n","44          mdp.39015002143181       The Reign of Wizardry  ...    1940  0.036703\n","0           nyp.33433076024060    The secret of the crater  ...    1900  0.036156\n","330           pst.000067168613             The Windup Girl  ...    2000  0.035960\n","134         mdp.39015008978879               Lord of Light  ...    1960  0.035803\n","5           nyp.33433074850870          The Princess Thora  ...    1900  0.035503\n","200               uc1.b4430797                      Cirque  ...    1970  0.033966\n","324         inu.30000107441606                Going Postal  ...    2000  0.033839\n","\n","[20 rows x 5 columns]"]},"metadata":{"tags":[]},"execution_count":19}]},{"cell_type":"code","metadata":{"id":"gUvfPTCa0qU8","colab_type":"code","outputId":"fffd23ca-9960-4f95-9f55-e512f54404e2","executionInfo":{"status":"ok","timestamp":1577642856963,"user_tz":-60,"elapsed":722,"user":{"displayName":"Federica Bologna","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mAbXMmexGk_lRJUtqkV2mkHhg7CLpyfyXZzkkac=s64","userId":"07332947698373575453"}},"colab":{"base_uri":"https://localhost:8080/","height":204}},"source":["citytopic_books[:20].groupby(\"decade\").count().htid"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["decade\n","1900    3\n","1910    2\n","1920    1\n","1930    2\n","1940    2\n","1960    2\n","1970    4\n","1980    1\n","2000    3\n","Name: htid, dtype: int64"]},"metadata":{"tags":[]},"execution_count":20}]},{"cell_type":"code","metadata":{"id":"1SA6MNH-zP20","colab_type":"code","colab":{}},"source":["citytopic_books.to_csv(\"/content/drive/My Drive/Università/3 ANNO MAGISTRALE/TESI/3_topicmodeling/citytopic_books.csv\")"],"execution_count":0,"outputs":[]}]}